Document clustering using synthetic cluster prototypes

نویسندگان

  • Argyris Kalogeratos
  • Aristidis Likas
چکیده

Article history: Received 17 December 2009 Received in revised form 11 December 2010 Accepted 13 December 2010 Available online 24 December 2010 The use of centroids as prototypes for clustering text documents with the k-means family of methods isnot always thebest choice for representing text clusters due to thehighdimensionality, sparsity, and low quality of text data. Especially for the cases where we seek clusters with small number of objects, the use of centroids may lead to poor solutions near the bad initial conditions. To overcome this problem, we propose the idea of synthetic cluster prototype that is computed by first selecting a subset of cluster objects (instances), then computing the representative of these objects and finally selecting important features. In this spirit, we introduce the MedoidKNN synthetic prototype that favors the representation of the dominant class in a cluster. These synthetic cluster prototypes are incorporated into the generic spherical k-means procedure leading to a robust clustering method called k-synthetic prototypes (k-sp). Comparative experimental evaluation demonstrates the robustness of the approach especially for small datasets and clusters overlapping in many dimensions and its superior performance against traditional and subspace clustering methods. © 2010 Elsevier B.V. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A supervised growing neural gas algorithm for cluster analysis

In this paper, a prototype-based supervised clustering algorithm is proposed. The proposed algorithm, called the Supervised Growing Neural Gas algorithm (SGNG), incorporates several techniques from some unsupervised GNG algorithms such as the adaptive learning rates and the cluster repulsion mechanisms of the Robust Growing Neural Gas algorithm, and the Type Two Learning Vector Quantization (LV...

متن کامل

Computational Intelligence Methods for Clustering of Sense Tagged Nepali Documents

This paper presents a method using hybridization of self organizing map (SOM ), particle swarm optimization(PSO) and k-means clustering algorithm for document clustering. Document representation is an important step for clustering purposes. The common way of represent a text is bag of words approach. This approach is simple but has two drawbacks viz. synonymy and polysemy which arise because of...

متن کامل

Wasserstein Metric Based Adaptive Fuzzy Clustering Methods for Symbolic Data

Given the current limitations in fuzzy clustering metric, the aim of this paper is to present new wasserstein metric based adaptive fuzzy clustering methods for partitioning symbolic interval data. Wasserstein metric shows adavantages in digging distribution information in symbolic interval data. Besides, the proposed fuzzy clustering methods also emphasize correlation structure between indices...

متن کامل

Web mining with relational clustering

Clustering is an unsupervised learning method that determines partitions and (possibly) prototypes from pattern sets. Sets of numerical patterns can be clustered by alternating optimization (AO) of clustering objective functions or by alternating cluster estimation (ACE). Sets of non–numerical patterns can often be represented numerically by (pairwise) relations. These relational data sets can ...

متن کامل

An Accelerated MapReduce-Based K-prototypes for Big Data

Big data are often characterized by a huge volume and a variety of attributes namely, numerical and categorical. To address this issue, this paper proposes an accelerated MapReduce-based k-prototypes method. The proposed method is based on pruning strategy to accelerate the clustering process by reducing the unnecessary distance computations between cluster centers and data points. Experiments ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Data Knowl. Eng.

دوره 70  شماره 

صفحات  -

تاریخ انتشار 2011